We consider the best-arm identification problem in multi-armed bandits, whichfocuses purely on exploration. A player is given a fixed budget to explore afinite set of arms, and the rewards of each arm are drawn independently from afixed, unknown distribution. The player aims to identify the arm with thelargest expected reward. We propose a general framework to unify sequentialelimination algorithms, where the arms are dismissed iteratively until a uniquearm is left. Our analysis reveals a novel performance measure expressed interms of the sampling mechanism and number of eliminated arms at each round.Based on this result, we develop an algorithm that divides the budget accordingto a nonlinear function of remaining arms at each round. We provide theoreticalguarantees for the algorithm, characterizing the suitable nonlinearity fordifferent problem environments described by the number of competitive arms.Matching the theoretical results, our experiments show that the nonlinearalgorithm outperforms the state-of-the-art. We finally study theside-observation model, where pulling an arm reveals the rewards of its relatedarms, and we establish improved theoretical guarantees in the pure-explorationsetting.
展开▼